Real-world data rarely comes clean... that's why data wrangling becomes an important part of the data analysis process.
The dataset that we will be wrangling, analyzing and visualizing in this project is the Tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs, Brent." WeRateDogs has over 8 million followers and has received international media coverage.
The goal of the following project is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. Using Python and its libraries, we will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. In this Jupyter Notebook, we will document the wrangling efforts and proceed with the analysis to gain valuable insights.
Univariate Analysis
Multivariate Analysis
# Let's import the necessary libraries for our analysis
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import StrMethodFormatter
import seaborn as sns
import requests
import tweepy
import json
from timeit import default_timer as timer
import re
import matplotlib.pylab as pylab
% matplotlib inline
# Let's set default parameters for our plots
params = {'axes.labelsize': 14,
'axes.titlesize': 15,
'xtick.labelsize': 12,
'ytick.labelsize': 12}
pylab.rcParams.update(params)
# Let's import the WeRateDogs Twitter archive data provided by Udacity
twitter_archive = pd.read_csv('twitter_archive_enhanced.csv')
# Let's see the first 5 rows of the WeRateDogs Twitter archive data
twitter_archive.head()
# Let's download the Tweet image predictions provided by Udacity
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)
with open('image_predictions.tsv', mode='wb') as file:
file.write(r.content)
# Let's import the Tweet image predictions data
tweet_predictions = pd.read_csv('image_predictions.tsv', sep='\t')
# Let's check the first 5 rows of the Tweet image predictions data
tweet_predictions.head()
The following two cells were initially code cells, but because of security reasons and to keep static data for the analysis, after the obtaining the 'tweet_json.txt' file with the first run, they were transformed to markdown cells.
# Let's create an API object that we can use to gather Twitter data.
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
# Let's get the Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = twitter_archive.tweet_id.values
len(tweet_ids)
# Let's query Twitter's API for JSON data for each Tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Let's save each Tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
# This loop will likely take 20-30 minutes to run because of Twitter's rate limit
for tweet_id in tweet_ids:
count += 1
print(str(count) + ": " + str(tweet_id))
try:
tweet = api.get_status(tweet_id, tweet_mode='extended')
print("Success")
json.dump(tweet._json, outfile)
outfile.write('\n')
except tweepy.TweepError as e:
print("Fail")
fails_dict[tweet_id] = e
pass
# A total of 2356 "Success" messages were printed out
end = timer()
print(end - start) # Result: 2395.341601019
# Let's check how many errors appeared during the Twitter API Query
len(fails_dict) # Result: 25
# Let's import the data from Twitter Query
# First, let's create a list to be converted into a pandas dataframe
tweet_list = []
with open('tweet_json.txt', encoding = 'utf-8') as json_file:
# We'll use for loop to convert txt file line by line to json object
for line in json_file.readlines():
data = json.loads(line)
tweet_id = data['id_str']
display_text_range = data['display_text_range']
in_reply_to_status_id = data['in_reply_to_status_id_str']
in_reply_to_user_id = data['in_reply_to_user_id_str']
in_reply_to_screen_name = data['in_reply_to_screen_name']
is_quote_status = data['is_quote_status']
retweet_count = data['retweet_count']
favorite_count = data['favorite_count']
favorited = data ['favorited']
retweeted = data['retweeted']
language = data['lang']
screen_name = data['user']['screen_name']
# We'll create a dictionary and append it to the tweet_list
tweet_dict = {'tweet_id': tweet_id, 'display_text_range': display_text_range,
'in_reply_to_status_id': in_reply_to_status_id, 'in_reply_to_user_id': in_reply_to_user_id,
'in_reply_to_screen_name': in_reply_to_screen_name, 'is_quote_status': is_quote_status,
'retweet_count': retweet_count, 'favorite_count': favorite_count, 'favorited': favorited,
'retweeted': retweeted, 'language': language, 'screen_name': screen_name}
tweet_list.append(tweet_dict)
# Let's convert tweet_list to dataframe
tweets_info = pd.DataFrame(tweet_list)
# Let's check the first 5 rows of tweets_info dataframe
tweets_info.head()
Before moving forward, let's describe each of the three dataframes gathered:
twitter_archive has the following columns:
tweet_predictions has the following columns:
tweets_info has the following columns:
The Dogtionary explains the various stages of dog: doggo, pupper, puppo, and floof(er) (via the #WeRateDogs book on Amazon)
# Let's visually assess the twitter_archive dataframe
twitter_archive.sample(5)
# Let's check some values of row 1001
print(twitter_archive.iloc[1001,].text)
print(twitter_archive.iloc[1001,].rating_numerator)
print(twitter_archive.iloc[1001,].rating_denominator)
print(twitter_archive.iloc[1001,]['name'])
print(twitter_archive.iloc[1001,].doggo)
print(twitter_archive.iloc[1001,].floofer)
print(twitter_archive.iloc[1001,].pupper)
print(twitter_archive.iloc[1001,].puppo)
# Let's look at the summary info of twitter_archive
twitter_archive.info()
There are a total of 2356 rows and 17 columns in twitter_archive table. It has 6 columns with null values.
# Let's analyze the quantitative columns of twitter_archive
twitter_archive.describe()
# Let's check the Tweets with anormal rating's numerator in of twitter_archive
twitter_archive[twitter_archive['rating_numerator'] > 200]
# Let's check if this row has a correct rating's numerator
print(twitter_archive[twitter_archive['rating_numerator'] > 200].iloc[2,].text)
print(twitter_archive[twitter_archive['rating_numerator'] > 200].iloc[2,].tweet_id)
This row doesn't have a correct rating's numerator
# Let's check these Tweets with anormal rating's denominator of twitter_archive
twitter_archive[twitter_archive['rating_denominator'] == 0.0]
# Let's check if this row has a correct rating's denominator
print(twitter_archive[twitter_archive['rating_denominator'] == 0.0].iloc[0,].text)
print(twitter_archive[twitter_archive['rating_denominator'] == 0.0].iloc[0,].tweet_id)
This row doesn't have a correct rating's denominator.
# Let's check the Tweets with anormal rating's numerator of twitter_archive
twitter_archive[twitter_archive['rating_numerator'] == 0.0]
# Let's check if this row has a correct rating's numerator
print(twitter_archive[twitter_archive['rating_numerator'] == 0.0].iloc[1,].text)
This row does have a correct rating's numerator.
# Checking the values count of "name" column in twitter_archive
twitter_archive.name.value_counts()
# Let's check the rows with "None" name in twitter_archive
twitter_archive[twitter_archive['name']== 'None']
# Let's check if the 35th row in fact does not have a dog name in twitter_archive
print(twitter_archive.iloc[35,].tweet_id)
print(twitter_archive.iloc[35,].text)
This Tweet's text does have a dog name.
# Let's check the count of values of "doggo" column in twitter_archive
twitter_archive.doggo.value_counts()
# Let's check if there are rows with word doggo included in "doggo" column in twitter_archive
twitter_archive[twitter_archive['doggo']=="None"]['text'].str.contains('doggo', case=False).sum()
# Let's check if there are rows with word pupper included in "pupper" column in twitter_archive
twitter_archive[twitter_archive['pupper']=="None"]['text'].str.contains('pupper', case=False).sum()
# Let's check if there are rows with word puppo included in "puppo" column in twitter_archive
twitter_archive[twitter_archive['puppo']=="None"]['text'].str.contains('puppo', case=False).sum()
# Let's check if there are rows with word floofer included in "floofer" column in twitter_archive
twitter_archive[twitter_archive['floofer']=="None"]['text'].str.contains('floofer', case=False).sum()
# Let's check if there are rows with word floof included in "floofer" column in twitter_archive
twitter_archive[twitter_archive['floofer']=="None"]['text'].str.contains('floof', case=False).sum()
# Let's check this row in twitter_archive
print(twitter_archive.iloc[1862].text)
print(twitter_archive.iloc[1862].pupper)
print(twitter_archive.iloc[1862].tweet_id)
This row has "None" in "pupper" column when it did had "puppers" in the Tweet's text.
# Let's check how many Tweets are retweets in twitter_archive
twitter_archive.retweeted_status_user_id.value_counts().sum()
# Let's check how many Tweets are replies in twitter_archive
twitter_archive.in_reply_to_status_id.value_counts().sum()
# Let's check this row in twitter_archive
print(twitter_archive[twitter_archive['tweet_id'] == 786709082849828864]['text'].values)
print(twitter_archive[twitter_archive['tweet_id'] == 786709082849828864]['rating_numerator'])
We can see that this row has an incorrect integer rating's numerator (75), when it should be a decimal number (9.75) as rating's numerator.
# Let's check if there are duplicated values in twitter_archive table
twitter_archive[twitter_archive.duplicated()]
There are no duplicated rows in twitter_archive table.
# Let's check the null values in twitter_archive table
twitter_archive[twitter_archive['in_reply_to_status_id'].isnull()].head()
# Let's check a summary of tweet_predictions table
tweet_predictions.info()
There are a total of 2075 rows and 12 columns in tweet_predictions table, and it doesn't have null values.
# Let's check if there are duplicated rows in tweet_predictions table
tweet_predictions[tweet_predictions.duplicated()]
There are no duplicated rows in tweet_predictions table.
# Let's check a summary of tweets_info table
tweets_info.info()
There are a total of 2331 rows and 12 columns in tweets_info table. It has 3 columns with null values.
# Let's check if there are duplicated rows in tweets_info table
tweets_info[tweets_info.astype(str).duplicated()]
There are no duplicated rows in tweets_info table.
# Let's check the null values in tweets_info table
tweets_info[tweets_info['in_reply_to_screen_name'].isnull()].head()
# Let's check the dupplicated columns in the three tables
all_columns = pd.Series(list(twitter_archive) + list(tweet_predictions) + list(tweets_info))
all_columns[all_columns.duplicated(keep=False)]
Quality Issues
twitter_archive
tweet_predictions
tweets_info
Tidyness Issues
twitter_archive, tweets_info and tweet_predictions are separated tables when they should be a single table. There are three duplicated columns in those tables ("tweet_id", "in_reply_to_status_id", "in_reply_to_user_id"), so one should be dropped during the join.# Let's create copy of each dataframe
twitter_archive_clean = twitter_archive.copy()
tweet_prediction_clean = tweet_predictions.copy()
tweet_info_clean = tweets_info.copy()
First, we will address the missing data:
#1: In twitter_archive:
twitter_archive_clean table, identify the names in "name" column that are lowercased and convert them to "None" values. twitter_archive_clean.# Let's identify the names in "name" column that are lowercased
# and convert them to "None" values
mask = twitter_archive_clean.name.str.islower()
column_name = 'name'
twitter_archive_clean.loc[mask, column_name] = 'None'
# Let's identify the names with less than 3 characters
name_list = twitter_archive_clean.name.value_counts().index
name_too_short = []
for i in name_list:
if len(i)< 3:
name_too_short.append(i)
print(i)
# Let's check the names with less than 3 characters
twitter_archive_clean[twitter_archive_clean.name.isin(name_too_short)]
# Fixing the names that were incorrect
twitter_archive_clean.at[775,'name'] = "O'Malley"
twitter_archive_clean.at[820,'name'] = "Al Cabone"
# Let's drop this row since it has multiple dogs
twitter_archive_clean.drop(876, inplace=True)
# Let's reset the indexes
twitter_archive_clean.reset_index(drop=True, inplace=True)
# Checking how many "None" values are in the "name" column
twitter_archive_clean.name.value_counts().head()
# Let's identify patterns in “text” column to find names within the rows with "None" values in the "name" column
twitter_archive_clean[twitter_archive_clean['name'] == 'None'].text.values
Patterns identified for single dog:
Patterns identified for multiple dogs:
# Let's identify names with the patterns identified for single dog
pattern = r"(?:name is |[Tt]his is |[Hh]ere we have |named |RIP |featuring |[\d]+/[\d]+ for )([A-Z][a-z']+)"
names = twitter_archive_clean[twitter_archive_clean['name']=='None']['text'].str.extract(pat=pattern)
names.fillna('None', inplace=True)
# Let's check how many names were identified
names[0].value_counts()
# Let's implace the new names in new_df (transition dataframe)
new_df = pd.merge(twitter_archive_clean, names, left_index=True, right_index=True, how='left')
new_df.fillna('None', inplace=True)
new_df.head()
# Let's check the implacement was successful.
# "None" values of "0" column should be 2356 - 1 - 854 + 810 = 2311
new_df[0].value_counts().head()
# Let's input the names found into the "None" values of "name" column
new_df['name_fixed'] = np.where(new_df['name']=='None',new_df[0],new_df['name'])
# Let's check if it was successful.
# "None" values should now be 810 as in "names" table
new_df['name_fixed'].value_counts().head()
# Let's drop the "name" column
new_df.drop(columns=['name', 0],inplace=True)
# Let's check the dropping was successful
new_df.head()
# Let's get the list of names we have so far
name_list = new_df.name_fixed.values
# Let's append to the list the names we could find during the pattern recognition process
# but didn't follow any specific pattern
name_list = np.append(name_list,['BOOMER','CHARLIE', 'Atlas', 'Boomer', 'Cooper', 'Grace', 'Jack', 'Meeko', 'Pippa', 'Pipsy', 'Rand Paul', 'Teddy', 'Toby', 'Tristan', 'Max', 'Martha', 'Teagan'])
# Let's keep only unique names
name_list = np.unique(name_list)
name_list
# Let's define a function to look for names in our name_list
def extract_names(x):
name = 'None'
for i in name_list:
if (i+" " in x) or (i+"." in x) or (i+"," in x):
name = i
return name
# Within the rows with "None" value in "name_fixed" column, we will find if there is any name from our name_list
new_df1 = new_df[new_df['name_fixed']=='None']['text'].map(extract_names)
new_df1 = pd.DataFrame(new_df1)
new_df1.rename(columns={'text': 'dog_name'}, inplace=True)
new_df1.dog_name.value_counts()
# Let's implace the new names in new_df2 (transition dataframe)
new_df2 = pd.merge(new_df, new_df1, left_index=True, right_index=True, how='left')
new_df2.fillna('None', inplace=True)
new_df2.head()
# Let's input the names found into the "None" values of "name_fixed" column
new_df2['name'] = np.where(new_df2['name_fixed']=='None',new_df2['dog_name'],new_df['name_fixed'])
new_df2['name'].str.title()
# Let's check if it was successful. "None" values should now be 739 as in "new_df1" table
new_df2['name'].value_counts().head()
# Let's drop the columns "name_fixed" and "dog_name"
new_df2.drop(columns=['name_fixed', 'dog_name'], inplace=True)
# Let's update our twitter_archive_clean with a copy of new_df2
twitter_archive_clean = new_df2.copy()
# Checking the two names were fixed correctly
print(twitter_archive_clean.at[775,'name']) # should be O'Malley
print(twitter_archive_clean.at[820,'name']) # should be Al Cabone
# Checking Tweet_id 885518971528720385 name is "Howard"
twitter_archive_clean[twitter_archive_clean['tweet_id'] == 885518971528720385]['name']
# Let's check our twitter_archive_clean dataframe has 2355 rows now
twitter_archive_clean.info()
# Let's check our twitter_archive_clean table has now 739 "None" values in name column
twitter_archive_clean.name.value_counts()
#2: In twitter_archive: Missing values in "expanded_urls" (there are only 2297).
Update the "expanded_urls" column using f"https://twitter.com/dog_rates/status/{tweet_id}" in the twitter_archive_clean table
# Let's update the "expanded_urls"
twitter_archive_clean['expanded_urls'] = twitter_archive_clean['tweet_id'].apply(lambda x: f"https://twitter.com/dog_rates/status/{x}")
# Let's check our updated table
twitter_archive_clean.head()
# Let's check the value counts of "expanded_urls" column
# There should not be any "None" value
twitter_archive_clean.expanded_urls.value_counts().head()
# Let's check the sum of value counts of "expanded_urls" column
# This should be 2355
twitter_archive_clean.expanded_urls.value_counts().sum()
# Let's check one value of the "expanded_urls" column
twitter_archive_clean.expanded_urls[0]
Next , we will tackle the completeness and tidiness issues combined.
#3: In twitter_archive:
twitter_archive_clean table, and input those values to a new column called "dog_stage".# Let's create patterns for every dog stage category
p1 = r"(?:[\s, \(\)\[\]#^&*+=}{:|\";?><~!']| -|^|^-|\.\.)(floof)s?(?:- |[\s ,?!:\)\(\]\[{}'~^&*+=|\";><#]|$|\.$|\. |\.\.|-$)"
p2 = r"(?:[\s, \(\)\[\]#^&*+=}{:|\";?><~!']| -|^|^-|\.\.)(doggo)s?(?:- |[\s ,?!:\)\(\]\[{}'~^&*+=|\";><#]|$|\.$|\. |\.\.|-$)"
p3 = r"(?:[\s, \(\)\[\]#^&*+=}{:|\";?><~!']| -|^|^-|\.\.)(pupper)s?(?:- |[\s ,?!:\)\(\]\[{}'~^&*+=|\";><#]|$|\.$|\. |\.\.|-$)"
p4 = r"(?:[\s, \(\)\[\]#^&*+=}{:|\";?><~!']| -|^|^-|\.\.)(puppo)s?(?:- |[\s ,?!:\)\(\]\[{}'~^&*+=|\";><#]|$|\.$|\. |\.\.|-$)"
p5 = r"(?:[\s, \(\)\[\]#^&*+=}{:|\";?><~!']| -|^|^-|\.\.)(floofer)s?(?:- |[\s ,?!:\)\(\]\[{}'~^&*+=|\";><#]|$|\.$|\. |\.\.|-$)"
# Let's define a function to extract dog stages
def extract_stages(x):
x = x.lower()
patterns = [p1,p2,p3,p4,p5]
name = 'None'
for p in patterns:
match = re.search(p,x)
if match is not None:
if match.group(1) == 'floof':
# We consider floof results in the floofer category
if name == 'None':
name = 'floofer'
else:
name = ",".join([name, 'floofer'])
else:
if name == 'None':
name = match.group(1)
else:
name = ",".join([name,match.group(1)])
return name
# Let's create a new column called "dog_stage" using that function
twitter_archive_clean['dog_stage'] = twitter_archive_clean['text'].map(extract_stages)
# Let's change "dog_stage" column to a category type.
twitter_archive_clean.dog_stage = twitter_archive_clean.dog_stage.astype('category')
# Let's drop the columns 'doggo', 'floofer', 'pupper', 'puppo'
twitter_archive_clean.drop(columns=['doggo', 'floofer', 'pupper', 'puppo'], inplace=True)
# Let's check our updated twitter_archive_clean table
twitter_archive_clean.head()
# Let's check our "dog_stage" column has 2355 rows
twitter_archive_clean['dog_stage'].value_counts().sum()
# Let's check the counts of values in "dog_stage" column
twitter_archive_clean['dog_stage'].value_counts()
# Let's check the Tweet_id 675432746517426176 has "pupper" in "dog_stage" column
twitter_archive_clean[twitter_archive_clean['tweet_id'] == 675432746517426176]['dog_stage']
# Let's check the "dog_stage" column is a category type
twitter_archive_clean.dtypes
#4:
tweet_predictions: there are only 2075 rows instead of 2356 rows. Probably because not all Tweets contained images.
tweets_info: there are only 2331 rows instead of 2356 rows. Probably because those Tweets were deleted.
twitter_archive, tweets_info and tweet_predictions are separated tables when they should be a single table. There are three duplicated columns in those tables ("tweet_id", "in_reply_to_status_id", "in_reply_to_user_id"), so one should be dropped during the join.twitter_archive_clean and tweet_info_clean, using the "tweet_id" column.tweet_prediction_clean, using the "tweet_id" column.# Let's change the datatype of "tweet_id" column of twitter_archive_clean to string.
twitter_archive_clean.tweet_id = twitter_archive_clean.tweet_id.astype(str)
# Let's check it was successful
twitter_archive_clean.dtypes
# Let's change the datatype of "tweet_id" column of tweet_info_clean to string.
tweet_info_clean.tweet_id = tweet_info_clean.tweet_id.astype(str)
# Let's check it was successful
tweet_info_clean.dtypes
# Let's change the datatype of "tweet_id" column of tweet_prediction_clean to string.
tweet_prediction_clean.tweet_id = tweet_prediction_clean.tweet_id.astype(str)
# Let's check it was successful
tweet_prediction_clean.dtypes
# Let's merge twitter_archive_clean and tweet_info_clean
table = pd.merge(twitter_archive_clean, tweet_info_clean, on='tweet_id', how='inner')
# Let's check the merge was successful
table.head()
# Let's check the shape of our merged table. It should be 2330 rows * 25 columns
table.shape
# Let's merge the merged table to tweet_prediction_clean
df_clean = pd.merge(table, tweet_prediction_clean, on='tweet_id', how='inner')
# Let's drop the duplicated columns
df_clean.drop(columns=['in_reply_to_status_id_x', 'in_reply_to_user_id_x'], inplace=True)
# Let's rename the columns
df_clean.rename(columns={'in_reply_to_status_id_y': 'in_reply_to_status_id', 'in_reply_to_user_id_y': 'in_reply_to_user_id'}, inplace=True)
# Let's reset the indexes
df_clean.reset_index(drop=True, inplace=True)
# Let's check if the merge was successful
df_clean.head()
# Let's check the summary of our new df_clean table
df_clean.info()
# Let's obtain the tweet_id list in table and tweet_prediction_clean
table_id_list = table.tweet_id.values
tp_id_list = tweet_prediction_clean.tweet_id.values
print(len(table_id_list)) #should be 2330
print(len(tp_id_list)) #should be 2075
# Let's check how many tweet_id of tweet_prediction_clean are not in table
not_matched = []
for i in tp_id_list:
if i not in table_id_list:
not_matched.append(i)
len(not_matched)
# Let's calculate how many rows would result from the merge of table and tweet_prediction_clean
len(tp_id_list) - len(not_matched)
# Let's calculate how many columns would result from the merge of table and tweet_prediction_clean
all_columns = pd.Series(list(twitter_archive) + list(tweet_predictions) + list(tweets_info))
n_columns = len(np.unique(all_columns)) - 4 + 1 #we removed 4 columns and added 1 in Cleaning Step #3
n_columns
# Let's check the shape of our new df_clean table. It should be 2058 rows and 34 columns
df_clean.shape
Finally, we'll clean up the rest of quality issues. It's worth noting that from now on we will only use the df_clean table for the following cleaning and analyses.
#5: "source" column includes html code instead of the real source of Tweet (phone, pc, etc).
df_clean table.# Let's check the structure of a couple of sources to determine our regex pattern
df_clean.source[0]
# Let's check the structure of a couple of sources to determine our regex pattern
df_clean.source[5]
# Let's extract the real source of the column "source" and replace the "source" column
pattern = r">([A-Za-z ]+)<"
df_clean['source'] = df_clean['source'].str.extract(pat=pattern)
# Let's verify the "source" column was correctly updated
df_clean['source'].value_counts()
# Let's check the updated table
df_clean.head(2)
#6: There are Tweets that are retweets or replies.
# Let's check how many rows are NOT retweets ("None" values)
df_clean.retweeted_status_id.value_counts().head()
# Let's identify the rows that are retweets
# They should be 2058 - 1986 = 72
df_clean[df_clean['retweeted_status_id'] != 'None']
# Let's get the indexes of the rows that are retweets
idx_to_drop = df_clean[df_clean['retweeted_status_id'] != 'None'].index.values
# Let's drop the rows that are retweets
df_clean.drop(idx_to_drop,inplace=True)
# Let's reset the indexes
df_clean.reset_index(drop=True, inplace=True)
# Let's check how many rows are replies
df_clean.in_reply_to_screen_name.value_counts()
# Let's identify the rows that are replies. They should be 23.
df_clean[df_clean['in_reply_to_screen_name'] == 'dog_rates']
# Let's drop the rows that are replies
idx_to_drop = df_clean[df_clean['in_reply_to_screen_name'] == 'dog_rates'].index.values
df_clean.drop(idx_to_drop,inplace=True)
# Let's reset the indexes
df_clean.reset_index(drop=True, inplace=True)
# Let's check there are only "None" values in "retweeted_status_id"
# This will verify that the retweets are gone
df_clean.retweeted_status_id.value_counts()
# Let's fill the null values with "None" to facilitate the analysis
df_clean.fillna('None', inplace=True)
# Let's check there are only "None" values in "in_reply_to_screen_name"
# This will verify that the replies are gone
df_clean.in_reply_to_screen_name.value_counts()
# Let's check the new shape of our updated table
df_clean.shape
#7:
# Let's identify Tweets with multiple dogs
pat = "their names are|they are named|they are"
# Let's check how many Tweets have multiple dogs in "text" column
print(df_clean.text.str.lower().str.contains(pat).sum())
# Let's check those rows with multiple dogs
df_clean[df_clean.text.str.lower().str.contains(pat)]
# Let's drop those rows with the exception of tweet_id 725729321944506368 that only has one dog
idx = df_clean[df_clean.text.str.lower().str.contains(pat)].index.values
idr = df_clean[df_clean['tweet_id']=='725729321944506368'].index.values
idx = np.delete(idx, np.where(idx==idr),axis=0)
df_clean.drop(idx, inplace=True)
# Let's reset the indexes
df_clean.reset_index(drop=True, inplace=True)
# Let's check that tweet_id 725729321944506368 is the only one left
df_clean[df_clean.text.str.lower().str.contains(pat)]
# Let's extract numerators and denominators using regex, and input them to transition columns
p_num = re.compile(r"([\d]+\.?[\d]*)\/(?:[\d]+\.?[\d]+|\d+)")
p_den = re.compile(r"(?:[\d]+\.?[\d]*)\/([\d]+\.?[\d]+|\d+)")
df_clean['num'] = df_clean['text'].apply(lambda x: ','.join(p_num.findall(x)))
df_clean['den'] = df_clean['text'].apply(lambda x: ','.join(p_den.findall(x)))
# Let's check each row got a match for numerator
# This should be 1963 - 15 = 1948
df_clean.num.value_counts().sum()
# Let's check each row got a match for denominator
# This should be 1963 - 15 = 1948
df_clean.den.value_counts().sum()
Let's anaylze rows with anormal values in numerator and denominator. First, let's analyze if there is a single digit denominator.
# Let's count values in the denominator column
df_clean.den.value_counts()
# Let's see the text of this row with denominator "7"
df_clean[df_clean['den']=='7'].text.values
# Since this row doesn't have a rating (it is just refering to time of the day), we will drop it
idx = df_clean[df_clean['den']=='7'].index.values
df_clean.drop(idx,inplace=True)
# Let's reset the indexes
df_clean.reset_index(drop=True, inplace=True)
# Let's check the row is gone
df_clean[df_clean['den']=='7']
# Let's see the text of this row with denominator "2,10"
df_clean[df_clean['den']=='2,10'].text.values
# Let's fix the rating's numerator and denominator
idx = df_clean[df_clean['den']=='2,10'].index.values
df_clean.at[idx,'num'] = '9'
df_clean.at[idx,'den'] = '10'
# Let's check the fix was successful
df_clean.iloc[idx]
Now, let's analyze the rows with two or more denominators.
# Let's count the Tweets with two or more denominators
print(df_clean['den'].str.contains(',10').sum())
# Let's look at the rows with two or more denominators
df_clean[df_clean['den'].str.contains(',10')]
# Let's drop those rows with multiple ratings, with the exception of
# tweet_id 740373189193256964, 722974582966214656, 716439118184652801, 682962037429899265 and 673295268553605120
# because they only have one real rating
idok = df_clean.query('tweet_id == "740373189193256964" or tweet_id == "722974582966214656" or tweet_id == "716439118184652801" or tweet_id == "682962037429899265" or tweet_id == "673295268553605120"').index.values
idx = df_clean[df_clean['den'].str.contains(',10')].index.values
idx = [i for i in idx if i not in idok]
df_clean.drop(idx,inplace=True)
# Let's reset the indexes
df_clean.reset_index(drop=True, inplace=True)
# This should only give 5 rows
# tweet_id 740373189193256964, 722974582966214656, 716439118184652801, 682962037429899265 and 673295268553605120
df_clean[df_clean['den'].str.contains(',10')]
# Now, let's fix the rating's numerator and denominator of these 5 rows
# tweet_id 740373189193256964
print(df_clean[df_clean['tweet_id']=="740373189193256964"].text.values)
idx = df_clean[df_clean['tweet_id']=="740373189193256964"].index.values
df_clean.at[idx,'num'] = '14'
df_clean.at[idx,'den'] = '10'
# tweet_id 722974582966214656
print(df_clean[df_clean['tweet_id']=="722974582966214656"].text.values)
idx = df_clean[df_clean['tweet_id']=="722974582966214656"].index.values
df_clean.at[idx,'num'] = '13'
df_clean.at[idx,'den'] = '10'
# tweet_id 716439118184652801
print(df_clean[df_clean['tweet_id']=="716439118184652801"].text.values)
idx = df_clean[df_clean['tweet_id']=="716439118184652801"].index.values
df_clean.at[idx,'num'] = '11'
df_clean.at[idx,'den'] = '10'
# tweet_id 682962037429899265
print(df_clean[df_clean['tweet_id']=="682962037429899265"].text.values)
idx = df_clean[df_clean['tweet_id']=="682962037429899265"].index.values
df_clean.at[idx,'num'] = '10'
df_clean.at[idx,'den'] = '10'
# tweet_id 673295268553605120
print(df_clean[df_clean['tweet_id']=="673295268553605120"].text.values)
idx = df_clean[df_clean['tweet_id']=="673295268553605120"].index.values
df_clean.at[idx,'num'] = '8'
df_clean.at[idx,'den'] = '10'
# Let's check there are not multiple denominators left
df_clean.den.value_counts()
# Let's check there are not multiple numerators left
df_clean.num.value_counts()
There are not more rows with multiple numerators.
We also notice there aren't denominators with zero value, but there is one numerator with zero value.
# Let's convert numerator column to float to ease the analysis
df_clean.num = df_clean.num.astype(float)
# Let's perform a brief summary of numerator column
df_clean.num.describe()
# Let's check the Tweets with numerator lower than 3
df_clean.query('num < 3').text.values
We noticed most of these Tweets (numerator lower than 3) are not about dogs (in fact, they are about other animals), even though they have names. However, the first row doesn't even have an animal in it (it's a complain about plagiarism), so we will drop it.
# Let's drop row with numerator equal to 0
idx = df_clean.query('num == 0').index.values
df_clean.drop(idx,inplace=True)
# Let's reset the indexes
df_clean.reset_index(drop=True, inplace=True)
# Let's check the row with numerator equal to 0 is gone
df_clean.query('num == 0')
# Let's drop "rating_numerator" and "rating_denominator" column
df_clean.drop(columns=['rating_denominator', 'rating_numerator'], inplace=True)
# Let's rename transition columns to "rating_numerator" and "rating_denominator".
df_clean.rename(columns={'num': 'rating_numerator', 'den': 'rating_denominator'}, inplace=True)
# Let's check there is no row with null numerator
df_clean[df_clean.rating_numerator.isnull()]
# Let's check value counts of "rating_numerator"
df_clean['rating_numerator'].value_counts()
# Let's check that "rating_numerator" doesn't have null values.
# This should result 1963 - 15 - 1 - 20 -1 = 1926
df_clean['rating_numerator'].value_counts().sum()
# Let's check that "rating_denominator" doesn't have "None" values
df_clean['rating_denominator'].value_counts()
# Let's check that "rating_denominator" doesn't have null values. This should result 1926
df_clean['rating_denominator'].value_counts().sum()
# Let's sample test one row to check the values are correct
df_clean.iloc[1001,].text
# This should be 7
df_clean.iloc[1001,].rating_numerator
# This should be 10
df_clean.iloc[1001,].rating_denominator
# Let's check Tweet_id 835246439529840640 "rating_numerator" and "rating_denominator"
print(df_clean[df_clean['tweet_id']=='835246439529840640'].text.values)
print(df_clean[df_clean['tweet_id']=='835246439529840640']['rating_numerator'])
print(df_clean[df_clean['tweet_id']=='835246439529840640']['rating_denominator'])
This Tweet was deleted because it was a Reply, so this query is empty.
# Let's check Tweet_id 786709082849828864 "rating_numerator" and "rating_denominator"
print(df_clean[df_clean['tweet_id']=='786709082849828864'].text.values)
print(df_clean[df_clean['tweet_id']=='786709082849828864']['rating_numerator'])
print(df_clean[df_clean['tweet_id']=='786709082849828864']['rating_denominator'])
#8:
Verify column datatypes in df_clean, and if they are not that datatype, change it as follows:
# Let's check the current datatypes
df_clean.dtypes
"tweet_id", "in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", "retweeted_status_user_id" and "in_reply_to_screen_name" are already a string type, so we will not change those.
# Let's convert "timestamp" column to datetime type
df_clean['timestamp'] = pd.to_datetime(df_clean['timestamp'])
# Let's check the values in "retweeted_status_timestamp" column
df_clean['retweeted_status_timestamp'].value_counts()
Since "retweeted_status_timestamp" column only have "None" values, we will keep it as a string type.
Also, "rating_numerator" column was already a float type so we will not convert it.
# Let's convert "rating_denominator" column to float type
df_clean['rating_denominator'] = df_clean['rating_denominator'].astype(float)
# Let's check the updated datatypes
df_clean.dtypes
#9: Missing values in "in_reply_to_status_id", "in_reply_to_user_id", "retweeted_status_id", "retweeted_status_user_id", "retweeted_status_timestamp" should be "None" instead of NaN because they are strings.
df_clean.# Let's check that there is only "None" values in 'in_reply_to_status_id' column
df_clean['in_reply_to_status_id'].value_counts()
# Let's check that there is only "None" values in 'in_reply_to_user_id' column
df_clean['in_reply_to_user_id'].value_counts()
# Let's check that there is only "None" values in 'retweeted_status_id' column
df_clean['retweeted_status_id'].value_counts()
# Let's check that there is only "None" values in 'retweeted_status_user_id' column
df_clean['retweeted_status_user_id'].value_counts()
# Let's check that there is only "None" values in 'retweeted_status_timestamp' column
df_clean['retweeted_status_timestamp'].value_counts()
# Let's drop the 5 previous columns
df_clean.drop(columns=['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], inplace=True)
# Let's check the 5 columns are gone
df_clean.info()
#10: "name", "source" and "text" columns should be more descriptive, like "dog_name", "tweet_source" and "tweet_text" respectively. Similarly, "expanded_urls" should be "expanded_url" to follow the format of the other columns.
Rename "name" column to "dog_name".
Rename "source" column to "tweet_source".
Rename "text" column to "tweet_text".
Rename "expanded_urls" column to "expanded_url".
# Let's rename 4 columns
df_clean.rename(columns={'name': 'dog_name', 'source': 'tweet_source', 'text': 'tweet_text', 'expanded_urls': 'expanded_url'}, inplace=True)
# Let's check the column names were successfully changed
df_clean.info()
#11: There are columns we will not use for Exploratory Data Analysis as long as they only have one value ("in_reply_to_screen_name", "is_quote_status" "screen_name"), so they should be removed.
# Let's check all values in "in_reply_to_screen_name" column are "None"
df_clean['in_reply_to_screen_name'].value_counts()
# Let's check all values in "is_quote_status" column are False
df_clean['is_quote_status'].value_counts()
# Let's check all values in "screen_name" column are "rate_dogs"
df_clean['screen_name'].value_counts()
# Let's drop those 3 columns
df_clean.drop(columns=['in_reply_to_screen_name', 'is_quote_status', 'screen_name'], inplace=True)
# Let's check whether there are other columns with one single value
df_clean['display_text_range'] = df_clean['display_text_range'].astype(str)
for col in list(df_clean):
print(col, df_clean[col].nunique())
# Let's check all values in 'favorited' column are False
df_clean['favorited'].value_counts()
# Let's check all values in 'retweeted' column are False
df_clean['retweeted'].value_counts()
# Let's drop those 2 columns
df_clean.drop(columns=['favorited', 'retweeted'], inplace=True)
# Let's check the 5 columns were correctly removed
df_clean.info()
# Let's check the shape of our updated df_clean
df_clean.shape
#12: There are Tweets that do not include dogs in them, they should be removed.
df table with "yes_dogs" table copy.# Let's first check the statistical summary of df_clean's quantitative columns
df_clean.describe()
# Let's look for Tweets' images that are predicted as dogs
# and input them to "yes_dogs" table
yes_dogs = df_clean[df_clean['p1_dog']==True]
# Let's look at the first 2 rows of "yes_dogs" table
yes_dogs.head(2)
# Let's look for Tweets' images that are not predicted as dogs
# and input them to "not_dogs" table
not_dogs = df_clean[df_clean['p1_dog']==False]
# Let's look at the first 2 rows of "not_dogs" table
not_dogs.head(2)
# Let's check the statistical summary of yes_dogs' quantitative columns
yes_dogs.describe()
# Let's check the statistical summary of not_dogs' quantitative columns
not_dogs.describe()
We can see that "yes_dogs" table has a less standard deviation in its "p1_conf", "rating_numerator" and "rating_denominator" columns, and a higher mean in "p1_conf", in comparison with "not_dogs" table. So, removing the "not_dogs" rows would improve those columns for our analysis.
# Let's see how many rows are in "not_dogs" table
not_dogs.shape[0]
# Let's check how many unique values are in the columns of "not_dogs" table
not_dogs.nunique()
# Let's check how many rows don't have a dog name in "not_dogs" table
not_dogs.dog_name.value_counts().head()
# Let's check how many rows don't have a dog name in "yes_dogs" table
yes_dogs.dog_name.value_counts().head()
We can see that if we removed the "not_dogs" rows from "df_clean" we could also reduce 175 non-name rows in the process, which is ok for our analysis.
# Let's check the "p2_dog" count in "not_dogs" table
not_dogs.p2_dog.value_counts()
# Let's check the "p2_dog" count in "yes_dogs" table
yes_dogs.p2_dog.value_counts()
We can also notice that only 141 out of 505 "not_dogs" have a #2 prediction of being a dog (28%), which is also favorable.
But before dropping the "not_dogs" rows, let's check the "not_dogs" rows with low and high numerators, separately.
# Let's check the "not_dog" rows with low numerator and input them to "not_dogs_low_num"
not_dogs_low_num = not_dogs[not_dogs['rating_numerator'] < 10]
# Let's check the first two rows of "not_dogs_low_num"
not_dogs_low_num.head(2)
# Let's check how many "not_dogs" have a low numerator
not_dogs_low_num.shape[0]
# Let's check the "not_dog" rows with high numerator and input them to "not_dogs_high_num"
not_dogs_high_num = not_dogs[not_dogs['rating_numerator'] >= 10]
# Let's check the first two rows of "not_dogs_high_num"
not_dogs_high_num.head(2)
# Let's check how many "not_dogs" have a high numerator
not_dogs_high_num.shape[0]
# Let's check the statistical summary of not_dogs_low_num's quantitative columns
not_dogs_low_num.describe()
# Let's check the statistical summary of not_dogs_high_num's quantitative columns
not_dogs_high_num.describe()
We can see that "not_dogs_high_num" have a higher standard deviation in "rating_numerator", compared to "not_dogs_low_num". Also, most of "not_dogs" haver a higher numerator (317/505).
# Let's check how many rows of "not_dogs_low_num" don't have a dog name
not_dogs_low_num.dog_name.value_counts().head()
# Let's check how many rows of "not_dogs_high_num" don't have a dog name
not_dogs_high_num.dog_name.value_counts().head()
# Let's check the count of "p2_dog" column in "not_dogs_low_num"
not_dogs_low_num.p2_dog.value_counts()
# Let's check the count of "p2_dog" column in "not_dogs_high_num"
not_dogs_high_num.p2_dog.value_counts()
We noticed that both "not_dogs_low_num" and "not_dogs_high_num" have "None" values in "dog_name" column and "True" values in "p2_dog".
So we will procede to remove "not_dogs" from our dataframe.
# Let's assign yes_dogs to a new dataframe
df = yes_dogs.copy()
# Let's reset indexes
df.reset_index(drop=True, inplace=True)
# Let's check our new dataframe
df.head()
# Let's check the "not dogs" rows are gone in df
df[df['p1_dog']==False]
# Let's check the values of "p1_dog" column are only True
df.p1_dog.value_counts()
# Let's check the shape of df. This should be 1421 rows and 24 columns
df.shape
From now on we will use df table for our analysis.
#13: After removing “not dogs” Tweets, "p2", "p2_conf", "p2_dog", "p3", "p3_conf", "p3_dog" and "p1_dog" columns will not be necessary for EDA so they should be dropped. Also, "p1" and "p1_conf" should have more descriptive names. And dog breeds in "p1" should be capitalized.
df. str.title().# Let's remove "p2", "p2_conf", "p2_dog", "p3", "p3_conf", "p3_dog" and "p1_dog" columns from df
df.drop(columns=["p2", "p2_conf", "p2_dog", "p3", "p3_conf", "p3_dog", "p1_dog"], inplace=True)
# Let's rename "p1" column to "dog_breed", and "p1_conf" to "prediction_confidence"
df.rename(columns={"p1":"dog_breed", "p1_conf":"prediction_confidence"}, inplace=True)
# Let's capitalize the values of "p1" column
df.dog_breed = df.dog_breed.str.title()
# Let's check the removal and rename was successful
df.info()
# Let's check the updated shaped of df. These should be 1421 rows and 17 columns
df.shape
# Let's check the values from "dog_breed" were correctly capitalized
df.dog_breed.value_counts()
#14: Identifiers of "language" column should be with their corresponding language (en: English, nl: Dutch, eu: Basque, et: Estonian) according to Best Current Practice (BCP) 47, to be more descriptive.
Replace identifiers of "language" column with their corresponding language (en: English, nl: Dutch, eu: Basque, et: Estonian)
# Let's replace identifiers of "language" column with their corresponding language
df['language'] = df['language'].map({'en': 'English', 'nl': 'Dutch', 'eu': 'Basque', 'et': 'Estonian'})
# Let's check the replacement was successful
df.language.value_counts()
To answer some of our questions and to get a deeper understanding of our data, there are variables we need to create from other existing columns. In the next cells, we will be proceding with that.
#1: Tweet Length (without url)
# Let's check some "display_text_range" values to find the pattern
df["display_text_range"].head(5)
# Let's check there is not "None" value in "display_text_range" column
df["display_text_range"].value_counts()
# Let's check there is not null value in "display_text_range" column. This should be 1421
df["display_text_range"].value_counts().sum()
# Let's extract position of last character of Tweet's text (without url) from "display_text_range" column
pattern = r"(\d+)\]$"
df['tweet_length'] = df['display_text_range'].str.extract(pat=pattern)
# Let's check the first rows of "tweet_length" column
df['tweet_length'].head(5)
# Let's check there is not "None" value in "tweet_length" column
df['tweet_length'].value_counts()
# Let's check there is not null value in "tweet_length" column. This should be 1421
df['tweet_length'].value_counts().sum()
# Let's convert the "tweet_length" column to integer
df['tweet_length'] = df['tweet_length'].astype(int)
# Let's add 1 to the "tweet_length" column
df['tweet_length'] = df['tweet_length'] + 1
# Let's check the "tweet_length" column is correctly calculated
df[['display_text_range', 'tweet_length']].head(5)
# Let's drop "display_text_range" column
df.drop(columns=["display_text_range"], inplace=True)
# Let's check the remaining columns and there isn't null values in "tweet_length" column
df.info()
# Let's check the first 5 rows of our updated table
df.head(5)
#2: Rating
Create a "rating" column with the division of "rating_numerator" and "rating_denominator".
# Let's create a "rating" column with the division of "rating_numerator" and "rating_denominator"
df['rating'] = df['rating_numerator']/df['rating_denominator']
# Let's check the new list of columns
df.info()
# Let's check the "rating" column was correctly calculated
df[['rating_numerator','rating_denominator','rating']].head(5)
# Let's value count in "rating" column
df.rating.value_counts()
# Let's check there is not "null" value in "rating" column. This should be 1421
df.rating.value_counts().sum()
#3: Engagement (retweet count + favorite count)
Create "engagement" column by adding "retweet_count" and "favorite_count" columns.
# Let's create "engagement" column by adding "retweet_count" and "favorite_count" columns
df["engagement"] = df["retweet_count"] + df["favorite_count"]
# Let's check our updated columns and there is no null value in "engagement" column
df.info()
# Let's check "engagement" column was correctly calculated
df[["retweet_count", "favorite_count", "engagement"]].head(5)
#4: Day of the week
Create a "day_of_week" columnn using "timestamp" column and pandas.Series.dt.day_name.
# Let's create a "day_of_week" column using "timestamp" column
df['day_of_week'] = df['timestamp'].dt.day_name()
# Let's check our updated columns and there is no null value in "day_of_week" column
df.info()
# Let's check "day_of_week" column was correctly created
df[['timestamp', 'day_of_week']].head(5)
#5: Weekend/Weekday
Create a column named "weekend_weekday" based on the value of "day_of_week" column: if it is "Saturday" or "Sunday", it is considered a "weekend", else it's a "weekday".
# Let's create a column named "weekend_weekday" based on the value of "day_of_week" column
df['weekend_weekday'] = df['day_of_week'].apply(lambda x: "weekend" if x == "Saturday" or x == "Sunday" else "weekday")
# Let's check our updated columns and there is no null value in "weekend_weekday" column
df.info()
# Let's check "weekend_weekday" was correctly calculated
df[['day_of_week', 'weekend_weekday']].sample(5)
df.to_csv('twitter_archive_master.csv', index=False)
# Let's first quickly review the histogram of all the numerical variables in the dataset
df.hist(figsize=(10,8));
# Let's check how many unique values are in the "favorite_count" column
df.favorite_count.nunique()
Even though "favorite_count" column is an integer type, since it has so many unique values, we will use a histogram instead of bar chart to analyze its distribution.
# Let's define the function for Histograms of Quantitative Continuous Values
def func_hist(x,var_name, xlabel,ylabel):
fig,ax = plt.subplots(figsize=(15,8))
plt.hist(x, bins='auto')
plt.title('Histogram of {}'.format(var_name), fontweight="bold")
plt.xlabel(xlabel)
plt.ylabel(ylabel)
ax.set_xlim(0,)
ax.set_ylim(0,)
return plt.show();
It's importat to notice that we chose the "auto" method to select the number of bins of histograms. These auto method chooses the maximum number of bins between the "Sturges" and "Freedman Diaconis" estimators.
# Let's see the histogram in detail for "favorite_count" column
func_hist(df['favorite_count'], 'Favorite Count of WeRateDogs Tweets', 'Favorite Count', 'Amount of Tweets')
We can see that the distribution of Favorite Count of WeRateDogs Tweets is skewed to the right. This shows that most of the rating Tweets of WeRateDogs in our clean dataset have a lower favorite count (less than 40,000) compared to the minimum amount of Tweets who have a favorite count over 40,000. This higher-favorite-count Tweets can even have over 100,000 favorites.
# Let's define the function for Boxplot of Quantitative Continuous Values
def func_box(X, var_name, xlabel, decimal=0):
fig, ax = plt.subplots(figsize=(16,6))
bp = plt.boxplot(X, vert=False, showmeans=True)
plt.title('Boxplot of {}'.format(var_name), fontweight="bold")
plt.xlabel(xlabel)
plt.yticks([1],["Boxplot"], rotation=90, verticalalignment="center")
ax.grid(b=True, axis='x')
# Adding mean label
if abs(X.mean() - X.median()) < 0.1:
ax.text(X.mean(), 1.02, '%.2f' %(X.mean()),color='green')
elif abs(X.mean() - X.median()) < 1:
ax.text(X.mean(), 1.02, '%.1f' %(X.mean()),color='green')
else:
ax.text(X.mean(), 1.02, '%.0f' %(X.mean()),color='green')
# Adding median label
x, y = bp['medians'][0].get_xydata()[1]
if abs(X.mean() - X.median()) < 0.1:
ax.text(x, y+0.01, '%.2f' % x, horizontalalignment='center')
elif abs(X.mean() - X.median()) < 1:
ax.text(x, y+0.01, '%.1f' % x, horizontalalignment='center')
else:
ax.text(x, y+0.01, '%.0f' % x, horizontalalignment='center')
if decimal == 0:
# Adding quantiles labels
x, y = bp['boxes'][0].get_xydata()[0]
ax.text(x,y-0.01, '%.0f' % x, horizontalalignment='center', verticalalignment='top')
x, y = bp['boxes'][0].get_xydata()[3]
ax.text(x,y-0.01, '%.0f' % x, horizontalalignment='center', verticalalignment='top')
# Adding whiskers labels
x, y = bp['caps'][0].get_xydata()[0]
ax.text(x,y-0.01, '%.0f' % x, horizontalalignment='center', verticalalignment='top')
x, y = bp['whiskers'][1].get_xydata()[1]
ax.text(x,y-0.05, '%.0f' % x, horizontalalignment='center', verticalalignment='top')
else:
# Adding quantiles labels
x, y = bp['boxes'][0].get_xydata()[0]
ax.text(x,y-0.01, '%.1f' % x, horizontalalignment='center', verticalalignment='top')
x, y = bp['boxes'][0].get_xydata()[3]
ax.text(x,y-0.01, '%.1f' % x, horizontalalignment='center', verticalalignment='top')
# Adding whiskers labels
x, y = bp['caps'][0].get_xydata()[0]
ax.text(x,y-0.01, '%.1f' % x, horizontalalignment='center', verticalalignment='top')
x, y = bp['whiskers'][1].get_xydata()[1]
ax.text(x,y-0.05, '%.1f' % x, horizontalalignment='center', verticalalignment='top')
# Setting xlim
if X.min() > 0.01:
if X.max()- X.min() < 2000 * X.min():
ax.set_xlim(0,)
return plt.show();
# Let's look at the statistical summary of "favorite_count" column
df.favorite_count.describe()
We can see above that the highest favorite count a WeRateDogs Tweet has ever received, in our clean dataset, is 155,733.
# Let's see the boxplot of "favorite_count" column
func_box(df['favorite_count'], 'Favorite Count of WeRateDogs Tweets', 'Favorite Count')
If we look at the boxplot of Favorite Count of WeRateDogs Tweets, we can see that the majority of Tweets in our clean dataset (between 25% and 75% percentile) had a favorite count between 2,044 and 10,883.
Also, it's worth noticing that since we have a great amount of outliers with high values, the mean (8,713) is much greater than the median (4,130). We should take this into consideration when deciding to use median or mean for our analysis.
# Let's check how many unique values are in the "retweet_count" column
df.retweet_count.nunique()
Even though "retweet_count" column is an integer type, since it has so many unique values, we will use a histogram instead of bar chart to analyze its distribution.
# Let's see the histogram in detail for "retweet_count" column
func_hist(df['retweet_count'], 'Retweet Count of WeRateDogs Tweets', 'Retweet Count', 'Amount of Tweets')
We can see that the distribution of Retweet Count of WeRateDogs Tweets is skewed to the right. This shows that most of the rating Tweets of WeRateDogs in our clean dataset have a lower retweet count (less than 20,000) compared to the minimum amount of Tweets who have a favorite count over 20,000. This higher-retweet-count Tweets can even have over 60,000 retweets.
# Let's look at the statistical summary of "retweet_count" column
df.retweet_count.describe()
We can see above that the highest retweet count a WeRateDogs Tweet has ever received, in our clean dataset, is 77,599.
# Let's see the boxplot of "retweet_count" column
func_box(df['retweet_count'], 'Retweet Count of WeRateDogs Tweets', 'Retweet Count')
If we look at the boxplot of Retweet Count of WeRateDogs Tweets, we can see that the majority of Tweets in our clean dataset (between 25% and 75% percentile) had a retweet count between 592 and 2,881.
Also, it's worth noticing that since we have a great amount of outliers with high values, the mean (2,503) is much greater than the median (1,278). We should take this into consideration when deciding to use median or mean for our analysis.
Since "rating_numerator" column is a float type, let's plot a histogram to analyze its distribution.
# Let's see the histogram in detail for "rating_numerator" column
func_hist(df['rating_numerator'], 'Rating\'s Numerator of WeRateDogs Tweets', 'Rating\'s Numerator', 'Amount of Tweets')
We can see that the distribution of Rating's Numerator of WeRateDogs Tweets is skewed to the right. This shows that most of the rating Tweets of WeRateDogs in our clean dataset have a lower numerator (less than 20) compared to the minimum amount of Tweets who have a numerator over 20. This higher-numerator Tweets can even be over 160.
# Let's look at the statistical summary of "rating_numerator" column
df.rating_numerator.describe()
We can see above that the highest Rating's Numerator a WeRateDogs Tweet has ever given, in our clean dataset, is 165.
# Let's see the boxplot of "rating_numerator" column
func_box(df['rating_numerator'], 'Rating\'s Numerator of WeRateDogs Tweets', 'Rating\'s Numerator')
If we look at the boxplot of Rating's Numerator of WeRateDogs Tweets, we can see that the majority of rating Tweets in our clean dataset (between 25% and 75% percentile) had a numerator between 10 and 12.
Also, it's worth noticing that since we have some outliers with high values, the mean (11.4) is greater than the median (11). We should take this into consideration when deciding to use median or mean for our analysis.
# Let's do value counts of "rating_denominator" column
df.rating_denominator.value_counts()
Even though the "rating_denominator" column is a float type, we will use a bar chart to analyze its distribution given the few unique values it has.
# Let's define the function for bar chart
def func_bar(X,title,xlabel,ylabel):
fig, ax = plt.subplots(figsize=(15,6))
X.plot.bar(color='lightblue')
plt.title(title, fontweight="bold")
plt.xlabel(xlabel)
plt.ylabel(ylabel)
ax.set_xlabel(xlabel, labelpad=15)
ax.set_ylabel(ylabel, labelpad=15)
plt.xticks(rotation=0)
return plt.show();
# Let's see the bar chart of "rating_denominator" column
ej1 = df['rating_denominator'].value_counts().sort_index()
func_bar(ej1, 'Bar Chart of Rating\'s Denominator of WeRateDogs Tweets', 'Rating\'s Denominator', 'Amount of Tweets')
We can see that the distribution of Rating's Denominator of WeRateDogs Tweets is skewed to the right. This shows that most of the rating Tweets of WeRateDogs in our clean dataset have a denominator equal to 10 compared to the minimum amount of Tweets who have a denominator over 10. This higher-denominator Tweets can even be over 100.
# Let's look at the statistical summary of "rating_denominator" column
df.rating_denominator.describe()
We can see above that the highest Rating's Denominator a WeRateDogs Tweet has ever given, in our clean dataset, is 150.
# Let's see the boxplot of "rating_denominator" column
func_box(df['rating_denominator'], 'Rating\'s Denominator of WeRateDogs Tweets', 'Rating\'s Denominator')
If we look at the boxplot of Rating's Denominator of WeRateDogs Tweets, we can see that the majority of rating Tweets in our clean dataset (between 25% and 75% percentile) had a denominator of 10.
Also, it's worth noticing that since we have some outliers with high values, the mean (10.4) is greater than the median (10). We should take this into consideration when deciding to use median or mean for our analysis.
Since "rating" column is a float type, let's plot a histogram to analyze its distribution.
# Let's see the histogram in detail for "rating" column
func_hist(df['rating'], 'Ratings of WeRateDogs Tweets', 'Rating', 'Amount of Tweets')
We can see that the distribution of Ratings of WeRateDogs Tweets is skewed to the left. This shows that most of the Tweets of WeRateDogs in our clean dataset have a higher rating value (more than 0.8) compared to the minimum amount of Tweets who have a rating less than 0.8.
# Let's look at the statistical summary of "rating" column
df.rating.describe()
We can see above that the highest Rating a WeRateDogs Tweet has ever given, in our clean dataset, is 1.4.
# Let's see the boxplot of "rating" column
func_box(df['rating'], 'Ratings of WeRateDogs Tweets', 'Rating', decimal=1)
If we look at the boxplot of Ratings of WeRateDogs Tweets, we can see that the majority of Tweets in our clean dataset (between 25% and 75% percentile) had a rating between 1 and 1.2.
Also, it's worth noticing that since we have some outliers with low values, the mean (1.09) is a little bit smaller than the median (1.10).
# Let's do value counts of "tweet_length" column
df['tweet_length'].value_counts()
Even though "tweet_length" column is an integer type, since it has many unique values, we decided to use a Histogram to analyze its distribution.
# Let's see the histogram in detail for "tweet_length" column
func_hist(df['tweet_length'], 'Tweet\'s Text Length (without URL) of WeRateDogs Tweets', 'Tweet\'s Text Length', 'Amount of Tweets')
We can see that the distribution of Tweet's Text Length (without URL) of WeRateDogs is skewed to the left. This shows that most of the rating Tweets of WeRateDogs in our clean dataset have a higher text length (more than 80) compared to the minimum amount of Tweets who have a text length less than 80.
# Let's look at the statistical summary of "tweet_length" column
df.tweet_length.describe()
We can see above that the average WeRateDogs Tweet's Text Length, in our clean dataset, is 111.29.
# Let's see the boxplot of "tweet_length" column
func_box(df['tweet_length'], 'Tweet\'s Text Length (without URL) of WeRateDogs Tweets', 'Tweet\'s Text Length (without URL)', decimal=1)
If we look at the boxplot of Tweet's Text Length (without URL) of WeRateDogs, we can see that the majority of Tweets in our clean dataset (between 25% and 75% percentile) had a Text Length between 94 and 137.
Also, it's worth noticing that since we have more Tweets with short Text Length, the mean (111) is smaller than the median (116).
Since "img_num" column is an integer type, let's plot a bar chart to analyze its distribution.
# Let's see the bar chart of "img_num" column
ej2 = df['img_num'].value_counts().sort_index()
func_bar(ej2, 'Bar Chart of WeRateDogs Tweet\'s image number used for prediction', 'Tweet\'s image number', 'Amount of Tweets')
We can see that the distribution of "WeRateDogs Tweet's image number used for prediction" is skewed to the right. This shows that most of the rating Tweets of WeRateDogs in our clean dataset used the first image for the breed prediction compared to the minimum amount of Tweets who used the other images.
# Let's look at the statistical summary of "img_num" column
df.img_num.describe()
# Let's see the boxplot of "img_num" column
func_box(df['img_num'], 'WeRateDogs Tweet\'s image number used for prediction', 'Tweet\'s image number')
If we look at the boxplot of "WeRateDogs Tweet's image number used for prediction", we can see that the majority of rating Tweets in our clean dataset (between 25% and 75% percentile) used the first image for the breed prediction.
Also, it's worth noticing that since we have outliers with high values, the mean (1.2) is greater than the median (1).
Since "prediction_confidence" column is a float type, let's plot a histogram to analyze its distribution.
# Let's see the histogram in detail for "prediction_confidence" column
func_hist(df['prediction_confidence']*100, 'Algorithm\'s Breed #1 Prediction Confidence of WeRateDogs Tweets (in percentage)', 'Algorithm\'s Breed Prediction Confidence', 'Amount of Tweets')
We can see that the distribution of "Algorithm's #1 Breed Prediction Confidence of WeRateDogs Tweets" is skewed to the left. This shows that most of the Algorithm's Breed #1 Predictions of WeRateDogs Tweets in our clean dataset have a higher prediction confidence (more than 40%), compared to the minimum amount of predictions who have a confidence less than 40%.
# Let's look at the statistical summary of "prediction_confidence" column
df.prediction_confidence.describe()
We can see above that the highest "Algorithm's Breed Prediction Confidence of WeRateDogs Tweets", in our clean dataset, is 99.9956%.
# Let's see the boxplot of "prediction_confidence" column
func_box(df['prediction_confidence']*100, 'Algorithm\'s Breed #1 Prediction Confidence of WeRateDogs Tweets (in porcentage)', 'Algorithm\'s Breed Prediction Confidence', decimal=1)
If we look at the boxplot of "Algorithm's Breed #1 Prediction Confidence of WeRateDogs Tweets", we can see that the majority of Algorithm's Breed #1 Predictions of WeRateDogs Tweets in our clean dataset (between 25% and 75% percentile) had a confidence between 39% and 85%.
Also, it's worth noticing that since we have more Tweets with smaller prediction confidence, the mean (61.4%) is a little bit smaller than the median (61.5%).
Since "dog_stage" column is a category type, let's use a bar chart to analyze its distribution.
# Let's define a new function for Bar Chart with mean value displayed, ticks sorted and labels in each bar
def func_bar1(X, Y, title, xlabel, ylabel, text_x, text_y, word_length, mean, meanlabel):
fig, ax = plt.subplots(figsize=(15,7))
plt.bar(np.arange(len(X)),Y,color='lightblue')
if len(X)>=10:
plt.xticks(np.arange(len(X)),X.str[:word_length]+'\n'+X.str[word_length:],rotation=45, fontsize=10)
else:
plt.xticks(np.arange(len(X)),X.str[:word_length]+'\n'+X.str[word_length:],rotation=0)
plt.title(title, fontweight="bold")
plt.xlabel(xlabel)
plt.ylabel(ylabel)
ax.set_xlabel(xlabel, labelpad=10)
ax.set_ylabel(ylabel, labelpad=15)
for i,v in enumerate(Y):
if Y.dtypes == "int64":
ax.text(i - text_x, v + text_y,str(v), color='blue')
else:
ax.text(i - text_x, v + text_y,'{:,.2f}'.format(v), color='blue')
if mean is not None and meanlabel is not None:
plt.hlines(mean, xmin=-0.4, xmax=len(X)-0.6, linestyles='dashed', label=meanlabel, color='grey')
ax.text(len(X) - 0.6 - text_x, mean + text_y*2,'{:,.2f}'.format(mean), color='black')
plt.legend()
return plt.show();
# Let's plot our Bar Chart of "dog_stage"
ej3 = df['dog_stage'].astype(str).value_counts()
x = pd.DataFrame(ej3.index.values, columns=['x'])
y = pd.DataFrame(ej3.values, columns=['y'])
func_bar1(x['x'], y['y'], 'Bar Chart of WeRateDogs Tweet\'s dog stages', 'Tweet\'s dog stage', 'Amount of Tweets', 0.1, 10, 13, y.y.mean(), 'Average of Amount of Tweets')
# Let's get the number of Tweets that did not have a dog stage
no_ds_tweets = df[df['dog_stage'] == "None"]['tweet_id'].count()
# Let's see the percentage of Tweets that did not have a dog stage
no_ds_tweets / df.shape[0] * 100
# Let's get the number of Tweets that have "pupper" as a dog stage
pupper_tweets = df[df['dog_stage'] == "pupper"]['tweet_id'].count()
# Let's see the percentage of Tweets that have "pupper" as a dog stage
pupper_tweets / df.shape[0] * 100
This shows that most of the rating Tweets of WeRateDogs in our clean dataset do not have a dog stage mentioned in it (83%). Out of the rating Tweets who do have a dog stage, the most prominent one is pupper with 155 Tweets (11%).
# # Let's check the unique values of "dog_breed" column
df.dog_breed.nunique()
Since "dog_breed" column is a string type with too many levels, let's use a bar chart to plot the top10 breeds.
# Let's calculate the general dog breed's average mentions
mean = df['dog_breed'].value_counts().mean()
mean
# Let's plot our Bar Chart of "dog_breed"
ej4 = df['dog_breed'].value_counts().head(10)
x = pd.DataFrame(ej4.index.values, columns=['x'])
y = pd.DataFrame(ej4.values, columns=['y'])
title = 'Bar Chart of Top 10 dog breeds (according to the Algorithm\'s #1 prediction)\n in terms of most frequent mentioned'
func_bar1(x['x'], y['y'], title, 'Tweet\'s dog breed', 'Amount of Tweets', 0.1, 1, 10, mean, 'Average of Amount of Tweets')
# Let's check how many more Tweets are predicted as 'Golden Retriever' than the general mean
gr_tweets = df[df['dog_breed'] == "Golden_Retriever"]['tweet_id'].count()
diff_tweets = (gr_tweets - mean)/mean
print(diff_tweets)
# Let's check the percentage of Tweets that were predicted as 'Golden Retriever'
gr_tweets / df.shape[0] * 100
In our clean dataset, the dog breed most included in the WeRateDogs' rating Tweets, according to the Algorithm's #1 prediction, was Golden Retriever with 133 Tweets (9%). This dog breed was mentioned 9 times more than the general average, according to the algorithm.
# Let's check the unique values of "dog_name" column
df.dog_name.nunique()
Since "dog_name" column is a string type with too many levels, let's use a bar chart to plot the top10 names.
# Let's calculate the general dog name's average tweet count
mean = df['dog_name'].value_counts().mean()
mean
# Let's plot our Bar Chart of "dog_name"
ej5 = df['dog_name'].value_counts().head(10)
x = pd.DataFrame(ej5.index.values, columns=['x'])
y = pd.DataFrame(ej5.values, columns=['y'])
title = 'Bar Chart of WeRateDogs Tweet\'s Top 10 Dog Names (in terms of amount of times mentioned)'
func_bar1(x['x'], y['y'], title, 'Tweet\'s dog name', 'Amount of Tweets', 0.1, 1, 10, mean, 'Average of Amount of Tweets')
# Let's check how many more Tweets have "Cooper" as dog name than the general mean
cooper_tweets = df[df['dog_name'] == "Cooper"]['tweet_id'].count()
diff_tweets = (cooper_tweets - mean)/mean
print(diff_tweets)
# Let's check the percentage of Tweets that did not have a dog name
df[df['dog_name'] == "None"]['tweet_id'].count() / df.shape[0] * 100
# Let's check the percentage of Tweets mentioned Cooper
cooper_tweets/df.shape[0]*100
In our clean dataset, most WeRateDogs' rating Tweets did not include a dog name (25%). However, "Cooper" is the dog name most included in the WeRateDogs' rating Tweets with 11 Tweets (0.77%). This dog name was mentioned 4 times more than the general average.
Since "tweet_source" column is a string type, let's plot a bar chart to analyze its distribution.
# Let's plot our Bar Chart of "tweet_source"
ej6 = df['tweet_source'].value_counts()
x = pd.DataFrame(ej6.index.values, columns=['x'])
y = pd.DataFrame(ej6.values, columns=['y'])
func_bar1(x['x'], y['y'], 'Bar Chart of WeRateDogs Tweet\'s source', 'Tweet\'s source', 'Amount of Tweets', 0.05, 5, 18, y.y.mean(), 'Average of Amount of Tweets')
# Let's check how many more Tweets come from iPhone than the general mean
iphone_tweets = df[df['tweet_source'] == "Twitter for iPhone"]['tweet_id'].count()
aveg_tweets = y.y.mean()
diff_tweets = ((iphone_tweets - aveg_tweets)/aveg_tweets)
print(diff_tweets)
# Let's check how many Tweets come from iPhone in percentage
iphone_tweets/df.shape[0]*100
In our clean dataset, most WeRateDogs' rating Tweets come from iPhone, with 1398 Tweets (98%). This source was used in an amount of Tweets almost 2 times higher than the general average.
Since "language" column is a string type, let's plot a bar chart to analyze its distribution.
# Let's plot our Bar Chart of "language" column
ej7 = df['language'].value_counts()
x = pd.DataFrame(ej7.index.values, columns=['x'])
y = pd.DataFrame(ej7.values, columns=['y'])
func_bar1(x['x'], y['y'], 'Bar Chart of WeRateDogs Tweet\'s Languages', 'Tweet\'s Language', 'Amount of Tweets', 0.05, 5, 18, y.y.mean(), 'Average of Amount of Tweets')
# Let's check how many more Tweets are in English than the general mean
en_tweets = df[df['language'] == "English"]['tweet_id'].count()
aveg_tweets = y.y.mean()
diff_tweets = ((en_tweets - aveg_tweets)/aveg_tweets)
print(diff_tweets)
# Let's check the percentage of Tweets in English
en_tweets/df.shape[0]*100
In our clean dataset, most WeRateDogs' rating Tweets are in English, with 1416 Tweets (99.6%). This language was used in an amount of Tweets almost 3 times higher than the general average.
# Let's check the retweet count per "timestamp" column
df.sort_values(by=['timestamp'])[['timestamp','retweet_count']]
Since "retweet_count" by "timestamp" would be too cramped in our bar chart, let's group by day and get the average of "retweet_count" per day, to check if it would be a better graph.
# Let's group by day and get the average of "retweet_count" per day
table = df.set_index('timestamp').resample("D")['retweet_count'].mean()
table
Since "retweet_count" by "day" would still be to cramped in our bar chart, let's group by week and get the average of "retweet_count" per week, to check if it would be a better graph.
# Let's group by week and get the average of "retweet_count" per week
table = df.set_index('timestamp').resample("W")['retweet_count'].mean()
table
This table looks better, so let's plot our bar chart to analyze data.
# Let's define the function for bar chart with datetime
def func_bar2(X,title,xlabel,ylabel,freq="weekly"):
fig, ax = plt.subplots(figsize=(15,6))
X.plot.bar(color='lightblue')
plt.title(title, fontweight="bold")
plt.xlabel(xlabel)
plt.ylabel(ylabel)
if freq=="weekly":
ticklabels = X.index.strftime('%Y-%m-%d')
else:
ticklabels = X.index.strftime('%Y-%m')
ax.xaxis.set_major_formatter(ticker.FixedFormatter(ticklabels))
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.2f}'))
ax.set_xlabel(xlabel, labelpad=15)
ax.set_ylabel(ylabel, labelpad=15)
plt.xticks(fontsize=10)
return plt.show();
# Let's plot our bar chart
func_bar2(table, 'Bar Chart of WeRateDogs Tweets\' Average Retweet Count by Week', 'Week', 'Average Retweet Count by Week')
In the bar chart above we can notice that WeRateDogs Tweets' Weekly Retweet Count has increased through the time, increasing its weekly count from 181 in the second week of November 2015 to 5,848 in the second week of July 2017.
Let's repeat the procedure with "favorite_count" column.
# Let's group by week and get the average of "favorite_count" per week
table = df.set_index('timestamp').resample("W")['favorite_count'].mean()
table
# Let's plot our bar chart
func_bar2(table, 'Bar Chart of WeRateDogs Tweets\' Average Favorite Count by Week', 'Week', 'Average Favorite Count by Week')
In the bar chart above we can notice that WeRateDogs Tweets' Weekly Favorite Count has increased through the time, increasing its weekly count from 880 in the second week of November 2015 to 28,798 in the second week of July 2017.
# Let's check the older date a Tweet was posted in df
df.timestamp.min()
# Let's check the most recent date a Tweet was posted in df
df.timestamp.max()
Since WeRateDogs' Tweets do no include the complete november 2015 and august 2017 months, let's only analyze full months (December 2015 - July 2017) for this question.
# Let's only get Tweets between December 2015 - July 2017
df_months = df[(df['timestamp'] >= '2015-12-01') & (df['timestamp'] <= '2017-07-31')]
# Let's calculate how many ratings Tweets were posted per month
table = df_months.set_index('timestamp').resample("M")['tweet_id'].count()
table
# Let's plot our bar chart
func_bar2(table, 'Bar Chart of WeRateDogs\' Tweet Count by Month', 'Month', 'Tweet Count by Month', freq="monthly")
print(table.min())
print(table.max())
july_mtc = table.loc['2017-07-31',]
# Let's check the decrease in Monthly Tweet Count in percentage
(july_mtc - table.max()) / table.max() * 100
In the bar chart above we can notice that WeRateDogs' Monthly Tweet Count has decreased through the time, decreasing its monthly count from 234 in December 2015 to 43 in July 2017 (82% less).
# Let's get the monthly "retweet_count"
table1 = df_months.set_index('timestamp').resample("M")['retweet_count'].mean()
table1
# Let's plot our bar chart
func_bar2(table1, 'Bar Chart of WeRateDogs Tweets\' Retweet Count by Month', 'Month', 'Retweet Count by Month', freq="monthly")
print(table1.min())
print(table1.max())
# Let's calculate the difference in monthly retweet counts
(table1.max() - table1.min())/ table1.min()
In the bar chart above we can notice that WeRateDogs Tweets' Monthly Retweet Count has increased through the time, increasing its monthly count from 1,159 in December 2015 to 6,135 in July 2017 (4 times higher).
# Let's get the monthly "favorite_count"
table2 = df_months.set_index('timestamp').resample("M")['favorite_count'].mean()
table2
# Let's plot our bar chart
func_bar2(table2, 'Bar Chart of WeRateDogs Tweets\' Favorite Count by Month', 'Month', 'Favorite Count by Month', freq="monthly")
print(table2.min())
print(table2.max())
# Let's calculate the difference in monthly favorite counts
(table2.max() - table2.min())/ table2.min()
In the bar chart above we can notice that WeRateDogs Tweets' Monthly Favorite Count has increased through the time, increasing its monthly count from 2,883 in December 2015 to 29,820 in July 2017 (9 times higher).
# Let's get the monthly "engagement" by "dog_stage"
table3 = df_months.set_index('timestamp').groupby('dog_stage').resample("M")['engagement'].mean()
table3 = table3.reset_index()
table3['month'] = pd.to_datetime(table3['timestamp']).dt.to_period('m')
table3.drop(columns=['timestamp'],inplace=True)
table3
# Let's reshape our table3 for our plot
table4 = table3.pivot(index='month', columns='dog_stage', values='engagement')
table4 = table4.fillna(0)
table4.head(5)
# Let's do our plot
table4.plot(figsize=(15, 9), title='Evolution of Dog Stage\'s Average Monthly Engagement', cmap='tab20c')
plt.ylabel("Average Monthly Engagement")
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.2f}'))
plt.xlabel("Month");
This graph is confusing given the amount of dog stages displaying, but we can spot that the "pupper" stage is somehow predominant in engagement through the majority of the months.
# Let's do a plot showcasing "pupper" stage
fig, ax = plt.subplots(figsize=(15,8))
# Multiple line plot
for column in table4:
plt.plot(table4.index.strftime('%Y-%m').values, table4[column].values, marker='', color='grey', linewidth=1, alpha=0.4)
# Now we re do the "pupper" curve, but bigger with distinct color
plt.plot(table4.index.strftime('%Y-%m').values, table4['pupper'].values, marker='', color='orange', linewidth=4, alpha=0.7)
# Annotate the plot
num = 0
heights=[]
import math
for i in table4.values[-1].tolist():
name = list(table4)[num]
# We will avoid genre labels overlapping by only showing the first genre in alphabetical order
if int(math.ceil(i / 10000.0)) * 10000 not in heights:
heights.append(int(math.ceil(i / 10000.0)) * 10000)
if name != 'pupper':
ax.text(len(table4.index.strftime('%Y-%m').values)-0.8, i, name, horizontalalignment='left', size='small', color='grey')
num += 1
# Add a special annotation for the 'pupper' stage
ax.text(len(table4.index.strftime('%Y-%m').values)-0.8, list(table4['pupper'].values)[-1]+2800, 'pupper', horizontalalignment='left', color='orange')
# Add titles and labels
plt.title("Evolution of Pupper stage vs other dog stages")
plt.xlabel("Month", labelpad=10)
plt.ylabel("Average Monthly Engagement", labelpad=5)
# Seting x axis limits
ax.set_xlim(0, len(table4.index)+0.75)
# Format the ticks
plt.xticks(rotation=90)
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}'))
plt.show();
In the line plot above we can see that the "Pupper" stage has received the greatest Average Monthly Engagement (retweet count plus favorite count) among other dog stages mentioned in WeRateDogs rating Tweets through the months, in our clean dataset.
From question #9 in Univariate Analysis, we can expect to "None" dog stage Tweets to be the most predominant (highest amount of Tweets) through the months, but let's confirm it with a multi-line plot.
# Let's get the monthly Tweet count by "dog_stage"
table3 = df_months.set_index('timestamp').groupby('dog_stage').resample("M")['tweet_id'].count()
table3 = table3.reset_index()
table3['month'] = pd.to_datetime(table3['timestamp']).dt.to_period('m')
table3.drop(columns=['timestamp'],inplace=True)
table3
# Let's reshape our table3 for our plot
table4 = table3.pivot(index='month', columns='dog_stage', values='tweet_id')
table4 = table4.fillna(0)
table4.head(5)
# Let's do our plot
table4.plot(figsize=(15, 9), title='Evolution of Dog Stage\'s Monthly Tweet Count', cmap='tab20c')
plt.ylabel("Monthly Tweet Count")
plt.gca().yaxis.set_major_formatter(StrMethodFormatter('{x:,.2f}'))
plt.xlabel("Month");
As expected, in our clean dataset, "None" dog stage Tweets were the most predominant (with highest amount of Tweets) through the months as seen in the line plot above.
# Let's check how many unique values are in "dog_breed" column
df.dog_breed.nunique()
Since "dog_breed" column has too many unique levels, let's just display the Top 10 dog breeds in terms of highest average retweet count.
# Let's get the dog breed's general average retweet count
avg_rt_db = df.groupby('dog_breed').retweet_count.mean().mean()
avg_rt_db
# Let's get the Top 10 dog breeds in terms of highest average retweet count
top10_breed_rt = df.groupby('dog_breed').retweet_count.mean().sort_values(ascending=False).head(10).reset_index()
top10_breed_rt
# Let's plot our bar chart
x = top10_breed_rt.dog_breed
y = top10_breed_rt.retweet_count
title = 'Bar Chart of Top 10 Dog Breeds (according to Algorithm\'s #1 prediction)\n in terms of Highest Average Retweet Count'
func_bar1(x, y, title, 'Dog Breed', 'Average Retweet Count', 0.25, 45, 12, avg_rt_db, 'General Average Retweet Count')
# Let's get the "Standard_Poodle" average retweet count
sp_avg_rc = top10_breed_rt[top10_breed_rt['dog_breed']=="Standard_Poodle"]['retweet_count']
# Let's calculate how much higher is its average compared to the general mean
(sp_avg_rc - avg_rt_db) / avg_rt_db * 100
In the Bar Chart above we can see that, within the WeRateDogs rating Tweets, Standard Poodle (according to Algorithm's #1 prediction) is the dog breed that received the Highest Average Retweet Count, even 176% higher than the general mean.
Since "dog_breed" column has too many unique levels, let's just display the Top 10 dog breeds in terms of highest average favorite count.
# Let's get the dog breed's general average favorite count
avg_fc_db = df.groupby('dog_breed').favorite_count.mean().mean()
avg_fc_db
# Let's get the Top 10 dog breeds in terms of highest average favorite count
top10_breed_fc = df.groupby('dog_breed').favorite_count.mean().sort_values(ascending=False).head(10).reset_index()
top10_breed_fc
# Let's plot our bar chart
x = top10_breed_fc.dog_breed
y = top10_breed_fc.favorite_count
title = 'Bar Chart of Top 10 Dog Breeds (according to Algorithm\'s #1 prediction)\n in terms of Highest Average Favorite Count'
func_bar1(x, y, title, 'Dog Breed', 'Average Favorite Count', 0.32, 110, 12, avg_fc_db, 'General Average Favorite Count')
# Let's get the "Saluki" average favorite count
sa_avg_fc = top10_breed_fc[top10_breed_fc['dog_breed']=="Saluki"].favorite_count
# Let's calculate how much higher is its average compared to the general mean
(sa_avg_fc - avg_fc_db) / avg_fc_db * 100
In the Bar Chart above we can see that, within the WeRateDogs rating Tweets, Saluki (according to Algorithm's #1 prediction) is the dog breed that received the Highest Average Favorite Count, even 189% higher than the general mean.
# Let's get the dog stage's general average retweet count
mean = df.groupby('dog_stage').retweet_count.mean().mean()
mean
# Let's get the dog stages' average retweet count
table = df.groupby('dog_stage').retweet_count.mean().sort_values(ascending=False).reset_index()
table
# Let's plot our bar chart
x = table.dog_stage
y = table.retweet_count
title = 'Bar Chart of Dog Stages\' Average Retweet Count'
func_bar1(x, y, title, 'Dog Stage', 'Average Retweet Count', 0.21, 110, 13, mean, 'General Average Retweet Count')
# Let's get the "doggo,puppo" average retweet count
top_rc = table[table['dog_stage']=="doggo,puppo"].retweet_count
# Let's calculate how much higher is its average compared to the general mean
(top_rc - mean) / mean * 100
In the Bar Chart above we can see that, within the WeRateDogs rating Tweets, the combination of doggo and puppo is the dog stage that received the Highest Average Retweet Count, even 198% higher than the general mean.
# Let's get the dog stage's general average favorite count
mean = df.groupby('dog_stage').favorite_count.mean().mean()
mean
# Let's get the dog stages' average favorite count
table = df.groupby('dog_stage').favorite_count.mean().sort_values(ascending=False).reset_index()
table
# Let's plot our bar chart
x = table.dog_stage
y = table.favorite_count
title = 'Bar Chart of Dog Stages\' Average Favorite Count'
func_bar1(x, y, title, 'Dog Stage', 'Average Favorite Count', 0.25, 150, 13, mean, 'General Average Favorite Count')
# Let's get the "doggo,puppo" average favorite count
top_fc = table[table['dog_stage']=="doggo,puppo"].favorite_count
# Let's calculate how much higher is its average compared to the general mean
(top_fc - mean) / mean * 100
In the Bar Chart above we can see that, within the WeRateDogs rating Tweets, the combination of doggo and puppo is the dog stage that received the Highest Average Favorite Count, even 144% higher than the general mean.
Since "dog_name" column has too many unique levels, let's just display the Top 10 dog names in terms of highest average retweet count.
# Let's get the dog name's general average retweet count
mean = df.groupby('dog_name').retweet_count.mean().mean()
mean
# Let's get the Top 10 dog names in terms of highest average retweet count
top10 = df.groupby('dog_name').retweet_count.mean().sort_values(ascending=False).head(10).reset_index()
top10
# Let's plot our bar chart
x = top10.dog_name
y = top10.retweet_count
title = 'Bar Chart of Top 10 Dog Names in terms of Highest Average Retweet Count'
func_bar1(x, y, title, 'Dog Name', 'Average Retweet Count', 0.32, 220, 8, mean, 'General Average Retweet Count')
# Let's get the "Stephan" average retweet count
top_rc = top10[top10['dog_name']=="Stephan"].retweet_count
# Let's calculate how much higher is its average compared to the general mean
(top_rc - mean) / mean
In the Bar Chart above we can see that, within the WeRateDogs rating Tweets, Stephan is the dog name that received the Highest Average Retweet Count, even 22 times higher than the general mean.
Since "dog_name" column has too many unique levels, let's just display the Top 10 dog names in terms of highest average favorite count.
# Let's get the dog name's general average favorite count
mean = df.groupby('dog_name').favorite_count.mean().mean()
mean
# Let's get the Top 10 dog names in terms of highest average favorite count
top10 = df.groupby('dog_name').favorite_count.mean().sort_values(ascending=False).head(10).reset_index()
top10
# Let's plot our bar chart
x = top10.dog_name
y = top10.favorite_count
title = 'Bar Chart of Top 10 Dog Names in terms of Highest Average Favorite Count'
func_bar1(x, y, title, 'Dog Name', 'Average Favorite Count', 0.32, 220, 7, mean, 'General Average Favorite Count')
# Let's get the "Stephan" average favorite count
top_fc = top10[top10['dog_name']=="Stephan"].favorite_count
# Let's calculate how much higher is its average compared to the general mean
(top_fc - mean) / mean
In the Bar Chart above we can see that, within the WeRateDogs rating Tweets, Stephan is the dog name that received the Highest Average Favorite Count, even 12 times higher than the general mean.
# Let's get the day of the week's general average retweet count
mean = df.groupby('day_of_week').retweet_count.mean().mean()
mean
# Let's get the day of the week's average retweet count
table = df.groupby('day_of_week').retweet_count.mean().sort_values(ascending=False).reset_index()
table
# Let's plot our bar chart
x = table.day_of_week
y = table.retweet_count
title = 'Bar Chart of Day of the Week\'s Average Retweet Count'
func_bar1(x, y, title, 'Day of the Week', 'Average Retweet Count', 0.18, 20, 9, mean, 'General Average Retweet Count')
# Let's get the "Saturday" average retweet count
top_rc = table[table['day_of_week']=="Saturday"].retweet_count
# Let's calculate how much higher is its average compared to the general mean
(top_rc - mean) / mean * 100
In the Bar Chart above we can see that, within the WeRateDogs rating Tweets, Saturday is the day with the Highest Average Retweet Count, 22% higher than the general mean.
# Let's get the day of the week's general average favorite count
mean = df.groupby('day_of_week').favorite_count.mean().mean()
mean
# Let's get the day of the week's average favorite count
table = df.groupby('day_of_week').favorite_count.mean().sort_values(ascending=False).reset_index()
table
# Let's plot our bar chart
x = table.day_of_week
y = table.favorite_count
title = 'Bar Chart of Day of the Week\'s Average Favorite Count'
func_bar1(x, y, title, 'Day of the Week', 'Average Favorite Count', 0.18, 50, 9, mean, 'General Average Favorite Count')
# Let's get the "Wednesday" average favorite count
top_fc = table[table['day_of_week']=="Wednesday"].favorite_count
# Let's calculate how much higher is its average compared to the general mean
(top_fc - mean) / mean * 100
In the Bar Chart above we can see that, within the WeRateDogs rating Tweets, Wednesday is the day with the Highest Average Favorite Count, 12% higher than the general mean.
# Let's get the general average retweet count for weekend and weekday
mean = df.groupby('weekend_weekday').retweet_count.mean().mean()
mean
# Let's get weekend and weekday's average retweet count
table = df.groupby('weekend_weekday').retweet_count.mean().sort_values(ascending=False).reset_index()
table
# Let's plot our bar chart
x = table.weekend_weekday
y = table.retweet_count
title = 'Bar Chart of Weekend and Weekday\'s Average Retweet Count'
func_bar1(x, y, title, 'Weekend or Weekday', 'Average Retweet Count', 0.05, -40, 9, mean, 'General Average')
# Let's get the "weekend" average retweet count
top = table[table['weekend_weekday']=="weekend"].retweet_count
# Let's calculate how much higher is Weekend's average compared to the general mean
(top - mean) / mean * 100
# Let's get the "weekday" average retweet count
min = table[table['weekend_weekday']=="weekday"].retweet_count.values
# Let's calculate how much higher is Weekend's average compared to Weekday's average
(top - min) / min * 100
In the Bar Chart above we can see that, WeRateDogs Weekends' Tweets have a higher Average Retweet Count than Weekdays Tweets. However, just 4% higher than the general mean and 8% higher than Weekdays' average.
# Let's get the general average favorite count for weekend and weekday
mean = df.groupby('weekend_weekday').favorite_count.mean().mean()
mean
# Let's get weekend and weekday's average favorite count
table = df.groupby('weekend_weekday').favorite_count.mean().sort_values(ascending=False).reset_index()
table
# Let's plot our bar chart
x = table.weekend_weekday
y = table.favorite_count
title = 'Bar Chart of Weekend and Weekday\'s Average Favorite Count'
func_bar1(x, y, title, 'Weekend or Weekday', 'Average Favorite Count', 0.05, 90, 9, mean, 'General\nAverage')
# Let's get the "weekend" average favorite count
top = table[table['weekend_weekday']=="weekend"].favorite_count
# Let's calculate how much higher is Weekend's average compared to the general mean
(top - mean) / mean * 100
# Let's get the "weekday" average favorite count
min = table[table['weekend_weekday']=="weekday"].favorite_count.values
# Let's calculate how much higher is Weekend's average compared to Weekday's average
(top - min) / min * 100
In the Bar Chart above we can see that, WeRateDogs Weekends' Tweets have a higher Average Favorite Count than Weekdays Tweets. However, just 0.8% higher than the general mean and 1.6% higher than Weekdays' average.
We will first separate the Tweets with the highest favorite count. In order to do that, for our minimum cut we will use the median instead of the mean, as there are too many outliers with high values.
# Let's create a table of Tweets with higher favorite count
high_fav = df[df['favorite_count'] >= df['favorite_count'].median()]
# Let's check the filter worked. This should be 4130
(high_fav['favorite_count']).min()
# Let's define a scatter matrix function
def func_scatter_matrix(X, title):
axes = pd.plotting.scatter_matrix(X, alpha = 0.9, figsize = (16, 16), diagonal = 'kde', color='lightblue')
plt.suptitle(title, y=0.9, fontsize=15, fontweight="bold")
# Adding the corelation coefficient
corr = X.corr().values
for i, v in zip(*plt.np.triu_indices_from(axes, k=1)):
axes[i, v].annotate("r: %.4f" %corr[i,v], (0.7, 0.8), xycoords='axes fraction', ha='center', va='center', color='black')
# Setting up label's size
if len(X.columns)>7:
[plt.setp(item.yaxis.get_label(), 'size', 9) for item in axes.ravel()]
[plt.setp(item.xaxis.get_label(), 'size', 9) for item in axes.ravel()]
else:
[plt.setp(item.yaxis.get_label(), 'size', 10) for item in axes.ravel()]
[plt.setp(item.xaxis.get_label(), 'size', 10) for item in axes.ravel()]
return plt.show();
# Let's create a scatter matrix to analyze the quantitative variables of these higher favorite count Tweets
func_scatter_matrix(high_fav, 'Scatter Matrix of Higher Favorite Count Tweets (with outliers)')
In this scatter matrix, we can spot a strong linear positive correlation between the following variables:
Engagement is calculated as the sum of Favorite count and Retweet count, so the strong correlation between them was somewhat expected.
We then proceed to analyze the data without the outliers to check if it affects the results.
# Let's calculate the IQR
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
# Let's check the df shape
df.shape
# Let's select only the quantitative columns
num_col = ['favorite_count', 'retweet_count', 'img_num', 'prediction_confidence', 'rating_numerator', 'rating_denominator', 'tweet_length', 'rating', 'engagement']
# Let's create a table for Tweets without the outliers
df_wo_out = df[~((df[num_col] < (Q1 - 1.5 * IQR)) | (df[num_col] > (Q3 + 1.5 * IQR))).any(axis=1)]
# Let's check the table was created successfully
df_wo_out.shape
Note that 372 rows considered outliers were dropped.
# Let's select only the highest favorite count Tweets
high_fav1 = df_wo_out[df_wo_out['favorite_count'] >= df['favorite_count'].median()]
# Let's check the filter worked. This should be 4130
(high_fav1['favorite_count']).min()
Note that we still considered the median from the "df" table as our minimum high value, in order to make both scatter matrix comparable.
# Let's check the unique values in "high_fav1" table
high_fav1.nunique()
# Let's create a scatter matrix to analyze the quantitative variables
# of these higher favorite count Tweets (without outliers),
# except "img_num" and "rating_denominator" column to avoid single-value columns
num_col = ['favorite_count', 'retweet_count', 'prediction_confidence', 'rating_numerator', 'tweet_length', 'rating', 'engagement']
func_scatter_matrix(high_fav1[num_col], 'Scatter Matrix of Higher Favorite Count Tweets (without outliers)')
In this new scatter matrix, we can spot some strong linear correlations between the following variables:
Rating is calculated by dividing Rating's numerator with Rating's denominator, so the strong correlation between them was expected. Likewise, Engagement is calculated as the sum of Favorite count and Retweet count, so the strong correlation between them was also expected.
We notice that with or without outliers, there is a strong linear correlation between retweet count and higher favorite count Tweets.
18.1 Is retweet count associated with higher favorite count Tweets?
There is a strong linear correlation between retweet count and higher favorite count.
18.2 Is rating's numerator associated with higher favorite count Tweets?
There is not a linear correlation between rating's numerator and higher favorite count.
18.3 Is rating's denominator associated with higher favorite count Tweets?
There is not a linear correlation between rating's denominator and higher favorite count.
18.4 Is rating associated with higher favorite count Tweets?
There is not a linear correlation between rating and higher favorite count.
18.5 Is Tweet length associated with higher favorite count Tweets?
There is not a linear correlation between Tweet's text length (without URL) and higher favorite count.
We will first separate the Tweets with the highest retweet count. In order to do that, for our minimum cut we will use the median instead of the mean, as there are too many outliers with high values.
# Let's create a table of Tweets with higher retweet count
high_ret = df[df['retweet_count'] >= df['retweet_count'].median()]
# Let's check the filter worked. This should be 1278
(high_ret['retweet_count']).min()
# Let's create a scatter matrix to analyze the quantitative variables of these higher retweet count Tweets
func_scatter_matrix(high_ret, 'Scatter Matrix of Higher Retweet Count Tweets (with outliers)')
In this scatter matrix, we can spot a strong linear positive correlation between the following variables:
Engagement is calculated as the sum of Favorite count and Retweet count, so the strong correlation between them was somewhat expected.
We then proceed to analyze the data without the outliers to check if it affects the results.
# Let's select only the highest retweet count Tweets
high_ret1 = df_wo_out[df_wo_out['retweet_count'] >= df['retweet_count'].median()]
# Let's check the filter worked. This should be 1278
(high_ret1['retweet_count']).min()
Note that we still considered the median from the "df" table as our minimum high value, in order to make both scatter matrix comparable.
# Let's check the unique values in "high_ret1" table
high_ret1.nunique()
# Let's create a scatter matrix to analyze the quantitative variables
# of these higher favorite count Tweets (without outliers),
# except "img_num" and "rating_denominator" column to avoid single-value columns
num_col = ['favorite_count', 'retweet_count', 'prediction_confidence', 'rating_numerator', 'tweet_length', 'rating', 'engagement']
func_scatter_matrix(high_ret1[num_col], 'Scatter Matrix of Higher Retweet Count Tweets (without outliers)')
In this new scatter matrix, we can spot some strong linear correlations between the following variables:
Rating is calculated by dividing Rating's numerator with Rating's denominator, so the strong correlation between them was expected. Likewise, Engagement is calculated as the sum of Favorite count and Retweet count, so the strong correlation between them was also expected.
We notice that with or without outliers, there is a strong linear correlation between favorite count and higher retweet count Tweets.
19.1 Is favorite count associated with higher retweet count Tweets?
There is a strong linear correlation between favorite count and higher retweet count.
19.2 Is rating's numerator associated with higher retweet count Tweets?
There is not a linear correlation between rating's numerator and higher retweet count.
19.3 Is rating's denominator associated with higher retweet count Tweets?
There is not a linear correlation between rating's denominator and higher retweet count.
19.4 Is rating associated with higher retweet count Tweets?
There is not a linear correlation between rating and higher retweet count.
19.5 Is Tweet length associated with higher retweet count Tweets?
There is not a linear correlation between Tweet's text length (without URL) and higher retweet count.
For our clean dataset (Rating Tweets of WeRateDog of one single dog and between November 15, 2015 and August 1st, 2017), we can conclude the following:
Limitations of this Project:
tweet_predictions data only included the breed prediction of one dog, the Tweets with more than one dog names or ratings were dropped when found during the cleaning process.tweets_info table ended up having only 2331 rows, instead of the 2356 rows of the twitter_archive table.tweet_predictions table did not have 16 Tweets of the tweets_info.Other considerations:
twitter_archive table.twitter_archive, tweets_info and tweet_predictions should be a single table because they refer to a same observational unit: Tweet, being it factual information or prediction using an algorithm.References: